Introduction

America’s greatest pastime, Baseball. It is one of the most storied sports in our history. But what if I told you there was a secret league of some of the greatest baseball players we’ve never heard of? This secret league was the NLB, Negro league baseball, a collection of smaller leagues of negro players during segregation in America. These leagues started around the 1920s and ended around the 1950s when many players were finally accepted into the MLB. I will explore some of these forgotten players and some of the statistics that have been hidden for many years.

Data

https://github.com/fivethirtyeight/negro-leagues-player-ratings

The github repository with the dataset, this analyses will explain the story and stats of many forgotten baseball stars.

Barrier of entry:

Negro league: 150 games as a batter or 60 games + starts as a pitcher

MLB: 300 games as a batter or 350 games + starts as a pitcher

The MLB players include both current players and Hall of Fame players

The data comes from fivethirtyeight and some of their data was sourced from https://www.seamheads.com/NegroLgs/ which is a collection of the NLB statistics.

The goal of our analysis

Libraries

library(tidyverse) #manipulate data
library(dplyr)
library(ggplot2) #for visualization
library(plotly) #interactive graphs

Our Table

library(readr)
RawNLBandMLB <- read_csv("negro-leagues-player-ratings.csv")

glimpse(RawNLBandMLB)
## Rows: 1,117
## Columns: 25
## $ playerID     <chr> "culbech01", "gosseph01", "herrmch01", "kratzer01", "pire…
## $ commonName   <chr> "Charlie Culberson", "Phil Gosselin", "Chris Herrmann", "…
## $ league       <chr> "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "…
## $ hof          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ startYear    <dbl> 2012, 2013, 2012, 2010, 2014, 2015, 2011, 2014, 2008, 201…
## $ endYear      <dbl> 2020, 2020, 2019, 2020, 2019, 2019, 2019, 2019, 2019, 202…
## $ totalGames   <dbl> 428, 359, 370, 335, 302, 326, 461, 419, 386, 313, 376, 48…
## $ positionWar  <dbl> -0.620, 0.895, -1.150, 1.715, 0.545, 1.310, -1.555, 4.340…
## $ averageHit   <dbl> 41.791451, 72.992105, 3.648244, 21.236047, 67.574190, 10.…
## $ patience     <dbl> 13.776205, 28.641438, 70.106180, 19.112442, 18.976314, 24…
## $ power        <dbl> 41.709774, 16.879935, 44.105636, 69.670569, 37.244759, 9.…
## $ speed        <dbl> 64.524912, 58.562483, 75.850803, 1.334059, 78.872856, 81.…
## $ defense      <dbl> 24.25810, 44.89518, 36.48244, 99.59161, 38.95998, 90.4982…
## $ gameCutoff   <dbl> 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 30…
## $ playerLabel  <chr> "Active Player", "Active Player", "Active Player", "Activ…
## $ shortWar     <dbl> -0.2346729, 0.4038719, -0.5035135, 0.8293433, 0.2923510, …
## $ positionCat  <chr> "Outfielder", "Middle IF", "Catcher", "Catcher", "Middle …
## $ position     <chr> "Batter", "Batter", "Batter", "Batter", "Batter", "Batter…
## $ careerStarts <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strikeOuts   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ control      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ fip          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ whip         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ era          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ fact         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Variables

Filtering

There are a lot of variables within this dataset that we don’t need within the scope of our analyses, so we were going to filter down to only the variables we need. We will then split the data into an NLB set and an MLB to use if needed.

NLBandMLB <- RawNLBandMLB %>% select(playerID, commonName, league, hof, startYear, endYear, totalGames, positionWar, averageHit, defense, gameCutoff, playerLabel, shortWar, positionCat, position, era)

NLB <- NLBandMLB %>% filter(league == 'NLB')

MLB <- NLBandMLB %>% filter(league == 'MLB')

glimpse(NLBandMLB)
## Rows: 1,117
## Columns: 16
## $ playerID    <chr> "culbech01", "gosseph01", "herrmch01", "kratzer01", "pirel…
## $ commonName  <chr> "Charlie Culberson", "Phil Gosselin", "Chris Herrmann", "E…
## $ league      <chr> "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "MLB", "M…
## $ hof         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ startYear   <dbl> 2012, 2013, 2012, 2010, 2014, 2015, 2011, 2014, 2008, 2018…
## $ endYear     <dbl> 2020, 2020, 2019, 2020, 2019, 2019, 2019, 2019, 2019, 2020…
## $ totalGames  <dbl> 428, 359, 370, 335, 302, 326, 461, 419, 386, 313, 376, 489…
## $ positionWar <dbl> -0.620, 0.895, -1.150, 1.715, 0.545, 1.310, -1.555, 4.340,…
## $ averageHit  <dbl> 41.791451, 72.992105, 3.648244, 21.236047, 67.574190, 10.8…
## $ defense     <dbl> 24.25810, 44.89518, 36.48244, 99.59161, 38.95998, 90.49823…
## $ gameCutoff  <dbl> 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300…
## $ playerLabel <chr> "Active Player", "Active Player", "Active Player", "Active…
## $ shortWar    <dbl> -0.2346729, 0.4038719, -0.5035135, 0.8293433, 0.2923510, 0…
## $ positionCat <chr> "Outfielder", "Middle IF", "Catcher", "Catcher", "Middle I…
## $ position    <chr> "Batter", "Batter", "Batter", "Batter", "Batter", "Batter"…
## $ era         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

This data is a lot more condense and ready for our analysts

Could an NLB player even compete in the majors?

So first, let’s see if any players have statistics from both the MLB and the NLB to compare to each other to see if their stats are similar in both leagues.

# Finding the stats for players that have entries in both the MLB and NLB graphs 
duplicatedData <- inner_join(x = NLB, y = MLB, by = "commonName") %>% select("commonName") 

PlayersInBothLeagues <- inner_join(NLBandMLB, duplicatedData, by = "commonName")

ggplot(PlayersInBothLeagues, aes(y = shortWar, x = commonName, fill =league )) +
  geom_bar(stat= 'identity', position = "dodge")

Only three players in this dataset had stats in both the MLB and NLB, so let’s compare their short wars across leagues. The most significant difference is in Sam Crawford’s stats, as his short war jumped almost 4 points. The story not told within these stats is that when Sam Crawford changed leagues, he also changed positions going from a pitcher to an outfielder, making his WAR stats jump a lot, so his transfer is hard to compare. But for Roy and Monte, we can see Roy was almost the same level player he was in the NLB, and Monte dropped from an almost MVP player in the NLB to an All-Star in the MLB. While this sample size is minimal, it shows that the competition in both leagues is somewhat comparable.

Are the distribution of WAR similar across both leagues?

# creating a box plot to show the distribution of WAR across leagues 
ggplot(NLBandMLB, mapping = aes(league, shortWar, fill = league)) +
  geom_boxplot()

While this is a very broad question, it shows that the negro players, on average, were a little worse than the MLB players, but there may be a hidden reason why the NLB graph is lower on average.

Who were the very best in the NLB and how do they compare to the MLB?

ggplot(NLBandMLB, aes(shortWar, positionWar, color = league)) +
  geom_point()

This graph is a graph of all players in this dataset and is colored by the league. PositionWar is notably better for the MLB players. positionWar increases the more games you play, so the top MLB players have over 2,000 games played compared to the Negro League players, who cap out around half of that. While the spread across shortWar is pretty even, what interests me the most is the number of Negro League players in the negative regarding position and ShortWar. The reasoning for this is the same as why the box plot above was skewed downwards for the Negro League, and that is, the barrier of entry is a lot lower for the Negro League for this dataset. This is not a complete view of every MLB player. While many MLB players have negative WAR stats for their careers, like the negro players, they didn’t play enough games in the MLB to be counted in this set, as some of the NLB players were in and out of the league pretty quickly but were able to play long enough to qualify to be in this dataset.

Let’s look at the player in the bottom left: Percy Forrest. Percy was a pitcher and an outfielder. On average, across his six seasons as a pro, Percy only played six games a season, and the NLB season had 81 games. His war stats show that he was a lousy player that didn’t play much, only six games a season, but could stick around the league for 6 seasons to play enough games to qualify to be in this dataset. In comparison to the MLB’s qualifications, where players have to play around 300 games, it’s more likely that someone with 300 games played wouldn’t be nearly as bad as Percy was.

Who are the superstars in the NLB?

NLBinteractive <- plot_ly(NLB, x = ~shortWar, y = ~positionWar, type = 'scatter', mode = 'markers',
        text = ~paste('Name ', commonName))

NLBinteractive

With this graph, we can see some of the best players that were in the Negro leagues. We see names like Josh Gibson, Dobbie Moore, and Charlie Smith, who put up similar shortWar stats to Babe Ruth, the unanimous best baseball player ever, and these are some names we’ve never heard of. These three players have shortWars over ten making them some of the best baseball players ever.

Where would the NLB batters rank all time?

NLBandMLB %>% select(commonName, league, averageHit, shortWar, hof) %>% arrange(desc(averageHit)) %>% slice_head(n=20)
## # A tibble: 20 × 5
##    commonName       league averageHit shortWar   hof
##    <chr>            <chr>       <dbl>    <dbl> <dbl>
##  1 Ty Cobb          MLB         100       8.02     1
##  2 Charlie Smith    NLB         100      10.3      0
##  3 Nap Lajoie       MLB          99.9     7.09     1
##  4 Ed Delahanty     MLB          99.9     7.97     1
##  5 Ted Williams     MLB          99.9     8.92     1
##  6 Rogers Hornsby   MLB          99.9     9.23     1
##  7 Tris Speaker     MLB          99.8     7.70     1
##  8 Rod Carew        MLB          99.8     5.05     1
##  9 Tony Gwynn       MLB          99.7     4.45     1
## 10 Josh Gibson      NLB          99.7    10.9      1
## 11 George Sisler    MLB          99.7     4.18     1
## 12 Wade Boggs       MLB          99.6     5.96     1
## 13 Honus Wagner     MLB          99.6     8.25     1
## 14 Stan Musial      MLB          99.6     6.83     1
## 15 Harry Heilmann   MLB          99.5     5.31     1
## 16 Roberto Clemente MLB          99.5     5.84     1
## 17 Eddie Collins    MLB          99.4     7.00     1
## 18 Heavy Johnson    NLB          99.4     5.42     0
## 19 Jose Altuve      MLB          99.4     4.50     0
## 20 Babe Ruth        MLB          99.3    11.1      1

First, let’s look at the batting statistics for players in both leagues; remember, these statistics for batting are in the percentile. We can see the 100th percentile as the best batters of all time, and One is Ty Cobb, a very famous player and regarded as one of the best. But right under Ty, we have Charlie Smith, a negro player in the 100th percentile of batters who many people have never heard of. And with only these two players, we can see Charlie has a shortWar of 2 points better than Ty. While there’s not an abundance, we also see Josh Gibson in the 99th percentile, and for both of these NLB players, the only player in the top 20 batters with a higher shortWar is Babe Ruth making their statistics an amazing feat.

ggplot(NLBandMLB, aes(averageHit, fill = league)) +
  geom_histogram(binwidth = 10) +
  facet_grid(~league)

This graph shows the number of players in each percentile for Batting and which league they’re in. As we can see, in the top ~95 percentile, there are around 20 players in the NLB. While that is not a massive amount of players, that is still a significant amount of players that have been forgotten throughout history. While we speak about Babe Ruth, Ty Cobb, and Hank Aaron, we could add 20 names of Negro League players to that conversation.

Well who was pitching to these batters?

This is a great question to ask because the batting stats for the Negro league players would be as impactful if the pitchers in the negro leagues were terrible, so let’s look and see.

NLBandMLB %>% select(commonName, era, shortWar, league, hof, playerLabel, position) %>% filter(era > 90) %>% ggplot() + geom_bar(aes(playerLabel,fill=position)) + ggtitle("ERA 90th percentile")

In this graph, we can see the number of players in the 90th percentile of ERA in the three main categories of our dataset. We can see that there are around 20 negro league pitchers that account for the 90th percentile showing that there were excellent pitchers in the negro league, making the stats that Charlie Smith and Josh Gibson just as impressive as those of babe ruth and Ty Cobb. In this graph, we also see a massive influx of active players in this 90th percentile. The reasoning for this is a long story, but it’s a combination of many things. A lot of great pitches have entered the league. There have been advances in analytics and how to pitch. There’s been a lot of cheating to improve things like spin rates to make these pitchers better, and there have been many rule changes to favor pitches. The MLB is currently changing rules and cracking down on cheating which would lower a lot of these statistics for the active players in the MLB.

How did their War compare to their counterparts?

ggplot(NLBandMLB, aes(era, shortWar, color = league)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Era and War of pitchers")

The Graph above shows the short war for the pitchers and the percentile they’re in for ERA. We can see, on average, for the 50th percentile and up the shortWar for, the pitchers in the NLB were better than their MLB counterparts. While I don’t think all the Negro League pitchers were better than the MLB pitchers, I think this graph shows that there were excellent pitchers in the negro leagues, further the impressiveness of the batting stats from many of the negro League players.

Who were these pitchers?

NLBinteractivePitchers <- plot_ly(NLB, x = ~era, y = ~shortWar, type = 'scatter', mode = 'markers',
        text = ~paste('Name: ', commonName))

NLBinteractivePitchers

This graph gives the names of many of the greatest pitchers in the negro leagues’ players, like Stachel Paige, Jose Leblanc, and Martin Dihigo. Three Pitchers that should be spoken about in the conversation of the greatest pitchers in the history of the game.

Summary

After looking at and comparing the data, I believe it was right for the MLB to recognize and add the stats of many of these players to the MLB as they had very similar competition, and many that came out of the Negro Leagues were able to produce as similar if not higher levels in the MLB then they did while in the Negor Leagues. I think it’s essential that we push the story of many of these Negro League players and shine the light on the league that was left in the darkness for many years.